HappyDB: Happy Moments Exploration - How is happiness expressed in words, and what topic does it have?

Yoojin Lee (yl4477)

Introduction

The subjective emotion of happiness is expressed differently by each individual. Using HappyDB, I want to explore which words people with certain demographic characteristics use to express their happiness and through which topics. I also aim to investigate if there's any difference in the intensity of happiness between them.

Dataset

HappyDB is a corpus of 100,000+ crowd-sourced happy moments. The goal of the corpus is to advance the state of the art of understanding the causes of happiness that can be gleaned from text.

HappyDB is a large scale collection of happy moments over 3 months on Amazon Mechanical Turk (MTurk.) For every task, we asked the MTurk workers to describe 3 happy moments in the past 24 hours (or past 3 months.)

Let's get started!

Importing necessary packages

1. Data preparation and basic analysis

1.1 Loading the HappyDB

There exist 7 predicted categoreis already, but this is not enough to understand the happy moment.

1.2 Preliminary Cleaning of text

For cleaned happy momoment, cleaning process is performed according to the criteria below:

  1. Convert it to a lowercase
  2. Remove punctuation Remove characters like ".", ",", "!", "@", "#", etc...
  3. Remove numbers
  4. Remove leading and trailing space

Let's see top 10 words by counting the occurence of each words.

Here's the box plot about counting categories by cleaned_hm column, but this is not enough to understand the happiness of individuals.

2. Text Mining

2.1 WordClouds

2.1.1 Word Frequency Analysis

I aim to conduct a Word Frequency Analysis through text analysis. There are noise words that are not very informative, such as happy, yesterday, today... I am going to clean the word cloud by removing this noise. Additionally, I wanted to remove unnecessary words from the analysis and extract the base form of verbs.

2.1.2 Associated Word Analysis

Through the above wordcloud, I examined which words appeared frequently. Now, I aim to pick the top 5 frequently appearing noun keywords, and using the co-occurrence frequency of words related to these keywords, I want to create a word cloud.

2.2 Sentiment Analysis

Since we are using the HappyDB, I will perform Sentiment Analysis on each sentence about the happy moment. Since we have demographic information, I was curious to see if there are any differences in emotional intensity based on the demographic information, and I am going to analyze it.

Let's visualize the sentiment distribution with histogram:

To analyze emotional intensity, we can make use of the VADER (Valence Aware Dictionary and sEntiment Reasoner) sentiment analysis tool which is specifically attuned to sentiments expressed in social media.

It is a lexicon and rule-based sentiment analysis tool that is used to determine polarity (positive, negative, neutral) and also intensity (strength) of the sentiment.

I am going to use VADER to analyze emotional intensity.

This compound score can be used as emotional intensity metric. This ranges from -1 (most extreme negative) to 1 (most extreme positive)

With this, let's analyze Emotional Intensity Based on Demographic information.

By Gender:

By Parenthood:

By Country:

This is hard to interpret, so I'm going to show Average Emotional Intensity by Country with world map.

By Marital Status:

From above, it is hard to figure out significant factor by analyzing sentiment, so I decided to do Topic Modeling to gain deeper understanding.

2.4 LDA Topic Modeling

Although we already know 7 categores of topics, but I am trying to get more detailed insight to figure out the words expressed about happiness by doing LDA Topic Modeling.

Preparing data for LDA:

Since I've already tokenized and cleaned the data, I just created a dictionary and a corpus for the LDA model.

Let's build the LDA model. I'm going to build a model with 5 topics.

Before visualizing the LDA model with tools, let's breifly view the topics!

2.5. Visualizing Topic wizard

Although PyLDAvis provided great visual investigation of topic models, but I wanted to investigate the complex relations between topics. So, I've used Topicwizard to create more advanced topic modeling visualization wordk.

By clicking above 'Click this link to open topicwizard', you can see the result like below images:

Topicwizard_img1.png

Topicwizard_img2.png

Topicwizard_img3.png

Conclusion

The analysis of HappyDB provided an opportunity to approach the complex nature of happiness based on demographic information. Although no significant differences were found in the process, analyzing frequently appearing words through a wordcloud and seeing which associated words came up allowed us to see which keywords were often expressed together with happy moments. Additionally, through topic modeling, we could identify which words frequently appeared based on specific topics, and determine the keywords associated with those topics.